contents

  

Front matter

foreword

preface

acknowledgments

about this book

about the authors

about the cover illustration

  

  1   An overview of machine learning and deep learning

  1.1   A first look at machine/deep learning: A paradigm shift in computation

  1.2   A function approximation view of machine learning: Models and their training

  1.3   A simple machine learning model: The cat brain

Input features

Output decisions

Model estimation

Model architecture selection

Model training

Inferencing

  1.4   Geometrical view of machine learning

  1.5   Regression vs. classification in machine learning

  1.6   Linear vs. nonlinear models

  1.7   Higher expressive power through multiple nonlinear layers: Deep neural networks

  2   Vectors, matrices, and tensors in machine learning

  2.1   Vectors and their role in machine learning

The geometric view of vectors and its significance in machine learning

  2.2   PyTorch code for vector manipulations

PyTorch code for the introduction to vectors

  2.3   Matrices and their role in machine learning

Matrix representation of digital images

  2.4   Python code: Introducing matrices, tensors, and images via PyTorch

  2.5   Basic vector and matrix operations in machine learning

Matrix and vector transpose

Dot product of two vectors and its role in machine learning

Matrix multiplication and machine learning

Length of a vector (L2 norm): Model error

Geometric intuitions for vector length

Geometric intuitions for the dot product: Feature similarity

  2.6   Orthogonality of vectors and its physical significance

  2.7   Python code: Basic vector and matrix operations via PyTorch

PyTorch code for a matrix transpose

PyTorch code for a dot product

PyTorch code for matrix vector multiplication

PyTorch code for matrix-matrix multiplication

PyTorch code for the transpose of a matrix product

  2.8   Multidimensional line and plane equations and machine learning

Multidimensional line equation

Multidimensional planes and their role in machine learning

  2.9   Linear combinations, vector spans, basis vectors, and collinearity preservation

Linear dependence

Span of a set of vectors

Vector spaces, basis vectors, and closure

  2.10 Linear transforms: Geometric and algebraic interpretations

Generic multidimensional definition of linear transforms

All matrix-vector multiplications are linear transforms

  2.11 Multidimensional arrays, multilinear transforms, and tensors

Array view: Multidimensional arrays of numbers

  2.12 Linear systems and matrix inverse

Linear systems with zero or near-zero determinants, and ill-conditioned systems

PyTorch code for inverse, determinant, and singularity testing of matrices

Over- and under- determined linear systems in machine learning

Moore Penrose pseudo-inverse of a matrix

Pseudo-inverse of a matrix: A beautiful geometric intuition

PyTorch code to solve overdetermined systems

  2.13 Eigenvalues and eigenvectors: Swiss Army knives of machine learning

Eigenvectors and linear independence

Symmetric matrices and orthogonal eigenvectors

PyTorch code to compute eigenvectors and eigenvalues

  2.14 Orthogonal (rotation) matrices and their eigenvalues and eigenvectors

Rotation matrices

Orthogonality of rotation matrices

PyTorch code for orthogonality of rotation matrices

Eigenvalues and eigenvectors of a rotation matrix: Finding the axis of rotation

PyTorch code for eigenvalues and vectors of rotation matrices

  2.15 Matrix diagonalization

PyTorch code for matrix diagonalization

Solving linear systems without inversion via diagonalization

PyTorch code for solving linear systems via diagonalization

Matrix powers using diagonalization

  2.16 Spectral decomposition of a symmetric matrix

PyTorch code for the spectral decomposition of a matrix

  2.17 An application relevant to machine learning: Finding the axes of a hyperellipse

PyTorch code for hyperellipses

  3   Classifiers and vector calculus

  3.1   Geometrical view of image classification

Input representation

Classifiers as decision boundaries

Modeling in a nutshell

Sign of the surface function in binary classification

  3.2   Error, aka loss function

  3.3   Minimizing loss functions: Gradient vectors

Gradients: A machine learning-centric introduction

Level surface representation and loss minimization

  3.4   Local approximation for the loss function

1D Taylor series recap

Multidimensional Taylor series and the Hessian matrix

  3.5   PyTorch code for gradient descent, error minimization, and model training

PyTorch code for linear models

Autograd: PyTorch automatic gradient computation

Nonlinear Models in PyTorch

A linear model for the cat brain in PyTorch

  3.6   Convex and nonconvex functions, and global and local minima

  3.7   Convex sets and functions

Convex sets

Convex curves and surfaces

Convexity and the Taylor series

Examples of convex functions

  4   Linear algebraic tools in machine learning

  4.1   Distribution of feature data points and true dimensionality

  4.2   Quadratic forms and their minimization

Minimizing quadratic forms

Symmetric positive (semi)definite matrices

  4.3   Spectral and Frobenius norms of a matrix

Spectral norms

Frobenius norms

  4.4   Principal component analysis

Direction of maximum spread

PCA and dimensionality

PyTorch code: PCA and dimensionality reduction

Limitations of PCA

PCA and data compression

  4.5   Singular value decomposition

Informal proof of the SVD theorem

Proof of the SVD theorem

Applying SVD: PCA computation

Applying SVD: Solving arbitrary linear systems

Rank of a matrix

PyTorch code for solving linear systems with SVD

PyTorch code for PCA computation via SVD

Applying SVD: Best low-rank approximation of a matrix

  4.6   Machine learning application: Document retrieval

Using TF-IDF and cosine similarity

Latent semantic analysis

PyTorch code to perform LSA

PyTorch code to compute LSA and SVD on a large dataset

  5   Probability distributions in machine learning

  5.1   Probability: The classical frequentist view

Random variables

Population histograms

  5.2   Probability distributions

  5.3   Basic concepts of probability theory

Probabilities of impossible and certain events

Exhaustive and mutually exclusive events

Independent events

  5.4   Joint probabilities and their distributions

Marginal probabilities

Dependent events and their joint probability distribution

  5.5   Geometrical view: Sample point distributions for dependent and independent variables

  5.6   Continuous random variables and probability density

  5.7   Properties of distributions: Expected value, variance, and covariance

Expected value (aka mean)

Variance, covariance, and standard deviation

  5.8   Sampling from a distribution

  5.9   Some famous probability distributions

Uniform random distributions

Gaussian (normal) distribution

Binomial distribution

Multinomial distribution

Bernoulli distribution

Categorical distribution and one-hot vectors

  6   Bayesian tools for machine learning

  6.1   Conditional probability and Bayes’ theorem

Joint and marginal probability revisited

Conditional probability

Bayes’ theorem

  6.2   Entropy

Geometrical intuition for entropy

Entropy of Gaussians

  6.3   Cross-entropy

  6.4   KL divergence

KLD between Gaussians

  6.5   Conditional entropy

Chain rule of conditional entropy

  6.6   Model parameter estimation

Likelihood, evidence, and posterior and prior probabilities

Maximum likelihood parameter estimation (MLE)

Maximum a posteriori (MAP) parameter estimation and regularization

  6.7   Latent variables and evidence maximization

  6.8   Maximum likelihood parameter estimation for Gaussians

Python PyTorch code for maximum likelihood estimation

Python PyTorch code for maximum likelihood estimation using gradient descent

  6.9   Gaussian mixture models

Probability density function of the GMM

Latent variables for class selection

Classification via GMM

Maximum likelihood estimation of GMM parameters (GMM fit)

  7   Function approximation: How neural networks model the world

  7.1   Neural networks: A 10,000-foot view

  7.2   Expressing real-world problems: Target functions

Logical functions in real-world problems

Classifier functions in real-world problems

General functions in real-world problems

  7.3   The basic building block or neuron: The perceptron

The Heaviside step function

Hyperplanes

Perceptrons and classification

Modeling common logic gates with perceptrons

  7.4   Toward more expressive power: Multilayer perceptrons (MLPs)

MLP for logical XOR

  7.5   Layered networks of perceptrons: MLPs or neural networks

Layering

Modeling logical functions with MLPs

Cybenko’s universal approximation theorem

MLPs for polygonal decision boundaries

  8   Training neural networks: Forward propagation and backpropagation

  8.1   Differentiable step-like functions

Sigmoid function

Tanh function

  8.2   Why layering?

  8.3   Linear layers

Linear layers expressed as matrix-vector multiplication

Forward propagation and grand output functions for an MLP of linear layers

  8.4   Training and backpropagation

Loss and its minimization: Goal of training

Loss surface and gradient descent

Why a gradient provides the best direction for descent

Gradient descent and local minima

The backpropagation algorithm

Putting it all together: Overall training algorithm

  8.5   Training a neural network in PyTorch

  9   Loss, optimization, and regularization

  9.1   Loss functions

Quantification and geometrical view of loss

Regression loss

Cross-entropy loss

Binary cross-entropy loss for image and vector mismatches

Softmax

Softmax cross-entropy loss

Focal loss

Hinge loss

  9.2   Optimization

Geometrical view of optimization

Stochastic gradient descent and minibatches

PyTorch code for SGD

Momentum

Geometric view: Constant loss contours, gradient descent, and momentum

Nesterov accelerated gradients

AdaGrad

Root-mean-squared propagation

Adam optimizer

  9.3   Regularization

Minimum descriptor length: An Occam’s razor view of optimization

L2 regularization

L1 regularization

Sparsity: L1 vs. L regularization

Bayes’ theorem and the stochastic view of optimization

Dropout

10   Convolutions in neural networks

10.1   One-dimensional convolution: Graphical and algebraical view

Curve smoothing via 1D convolution

Curve edge detection via 1D convolution

One-dimensional convolution as matrix multiplication

PyTorch: One-dimensional convolution with custom weights

10.2   Convolution output size

10.3   Two-dimensional convolution: Graphical and algebraic view

Image smoothing via 2D convolution

Image edge detection via 2D convolution

PyTorch: 2D convolution with custom weights

Two-dimensional convolution as matrix multiplication

10.4   Three-dimensional convolution

Video motion detection via 3D convolution

PyTorch: Three-dimensional convolution with custom weights

10.5   Transposed convolution or fractionally strided convolution

Application of transposed convolution: Autoencoders and embeddings

Transposed convolution output size

Upsampling via transpose convolution

10.6   Adding convolution layers to a neural network

PyTorch: Adding convolution layers to a neural network

10.7   Pooling

11   Neural networks for image classification and object detection

11.1   CNNs for image classification: LeNet

PyTorch: Implementing LeNet for image classification on MNIST

11.2   Toward deeper neural networks

VGG (Visual Geometry Group) Net

Inception: Network-in-network paradigm

ResNet: Why stacking layers to add depth does not scale

PyTorch Lightning

11.3   Object detection: A brief history

R-CNN

Fast R-CNN

Faster R-CNN

11.4   Faster R-CNN: A deep dive

Convolutional backbone

Region proposal network

Fast R-CNN

Training the Faster R-CNN

Other object-detection paradigms

12   Manifolds, homeomorphism, and neural networks

12.1   Manifolds

Hausdorff property

Second countable property

12.2   Homeomorphism

12.3   Neural networks and homeomorphism between manifolds

13   Fully Bayes model parameter estimation

13.1   Fully Bayes estimation: An informal introduction

Parameter estimation and belief injection

13.2   MLE for Gaussian parameter values (recap)

13.3   Fully Bayes parameter estimation: Gaussian, unknown mean, known precision

13.4   Small and large volumes of training data, and strong and weak priors

13.5   Conjugate priors

13.6   Fully Bayes parameter estimation: Gaussian, unknown precision, known mean

Estimating the precision parameter

13.7   Fully Bayes parameter estimation: Gaussian, unknown mean, unknown precision

Normal-gamma distribution

Estimating the mean and precision parameters

13.8   Example: Fully Bayesian inferencing

Maximum likelihood estimation

Bayesian inference

13.9   Fully Bayes parameter estimation: Multivariate Gaussian, unknown mean, known precision

13.10 Fully Bayes parameter estimation: Multivariate, unknown precision, known mean

Wishart distribution

Estimating precision

14   Latent space and generative modeling, autoencoders, and variational autoencoders

14.1   Geometric view of latent spaces

14.2   Generative classifiers

14.3   Benefits and applications of latent-space modeling

14.4   Linear latent space manifolds and PCA

PyTorch code for dimensionality reduction using PCA

14.5   Autoencoders

Autoencoders and PCA

14.6   Smoothness, continuity, and regularization of latent spaces

14.7   Variational autoencoders

Geometric overview of VAEs

VAE training, losses, and inferencing

VAEs and Bayes’ theorem

Stochastic mapping leads to latent-space smoothness

Direct minimization of the posterior requires prohibitively expensive normalization

ELBO and VAEs

Choice of prior: Zero-mean, unit-covariance Gaussian

Reparameterization trick

  

appendix

  

notations

  

index

← Previous Section 4 of 22 Next →